Introduction

In this project we use a dataset of wine reviews to predict review points from numerical, categorical and textual predictors.

The data is from Kaggle Datasets, and covers 150k wine reviews along with some attributes of the wines. It can be found here. (A (free) Kaggle login is required to access it directly from kaggle.com). The data was originally scraped from WineEnthusiast.

The dataset contains the following columns:

This is a particularly interesting problem for several reasons:

Methods

Feature Engineering

## [1] "points"           "price"            "continentTopFive"
## [4] "topic1"           "sentiment"

Training and Test Set Generation

We create an 80% training, 20% test split for later use in cross validation.

Model Selection

Model #1

## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 2824.7, df = 20, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod1), 5000)
## W = 0.91738, p-value < 2.2e-16

Model 2

## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 5604, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod2), 5000)
## W = 0.99597, p-value = 1.765e-10

Model 3

## 
##  studentized Breusch-Pagan test
## 
## data:  mod3
## BP = 4521.8, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod3), 5000)
## W = 0.99349, p-value = 2.522e-14

Model 4

## 
##  studentized Breusch-Pagan test
## 
## data:  mod4
## BP = 4464.5, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod4), 5000)
## W = 0.99624, p-value = 5.742e-10

Results

rmse_data
##         CV RMSE
## model1 2.561139
## model2 2.433134
## model3 2.450683
## model4 2.532468

Discussion

We used four different strategies to attempt to meet the assumptions of multiple linear regression and minimize cross-validated RMSE. Attempts include transforming response with BoxCox and log, integrated polynomial models over the engineered features, and weighted least squares regression. In the end, the non-transformed response model had the lowest cross-validated RMSE with a value of 2.4331344.

Our coefficient results are as follows:

coef(summary(mod2))
##                                             Estimate Std. Error
## (Intercept)                              70.73392019  4.4377394
## log(price)                                6.45145661  0.6792085
## continentTopFiveAsia                    -16.50523848 10.2899555
## continentTopFiveAustralia               -11.72544870  4.4348757
## continentTopFiveEuropeTop5               -3.96254922  4.0460673
## continentTopFiveNorthAmerica              6.07493460  4.0362483
## continentTopFiveOtherEurope              -0.28473634  4.9468251
## continentTopFiveSouthAmerica            -14.87145016  4.2551687
## topic1                                   16.10546421  8.8309836
## log(price):continentTopFiveAsia          -1.00496932  0.3779971
## log(price):continentTopFiveAustralia      0.12886966  0.1699013
## log(price):continentTopFiveEuropeTop5    -0.03923637  0.1573030
## log(price):continentTopFiveNorthAmerica   0.29455211  0.1576415
## log(price):continentTopFiveOtherEurope    0.10960308  0.1859945
## log(price):continentTopFiveSouthAmerica   0.39977604  0.1655958
## log(price):topic1                        -6.98679668  1.3173636
## continentTopFiveAsia:topic1              36.34992976 20.0892397
## continentTopFiveAustralia:topic1         22.46995770  8.8227125
## continentTopFiveEuropeTop5:topic1         7.70128325  8.0496260
## continentTopFiveNorthAmerica:topic1     -15.87112737  8.0286509
## continentTopFiveOtherEurope:topic1        0.04715288  9.8293135
## continentTopFiveSouthAmerica:topic1      25.96146830  8.4714437
##                                             t value     Pr(>|t|)
## (Intercept)                             15.93917850 4.043829e-57
## log(price)                               9.49849174 2.178339e-21
## continentTopFiveAsia                    -1.60401456 1.087143e-01
## continentTopFiveAustralia               -2.64391824 8.196639e-03
## continentTopFiveEuropeTop5              -0.97935820 3.274056e-01
## continentTopFiveNorthAmerica             1.50509438 1.323033e-01
## continentTopFiveOtherEurope             -0.05755941 9.540997e-01
## continentTopFiveSouthAmerica            -3.49491435 4.744376e-04
## topic1                                   1.82374523 6.819385e-02
## log(price):continentTopFiveAsia         -2.65866962 7.846347e-03
## log(price):continentTopFiveAustralia     0.75849736 4.481552e-01
## log(price):continentTopFiveEuropeTop5   -0.24943173 8.030274e-01
## log(price):continentTopFiveNorthAmerica  1.86849366 6.169648e-02
## log(price):continentTopFiveOtherEurope   0.58928117 5.556741e-01
## log(price):continentTopFiveSouthAmerica  2.41416835 1.577312e-02
## log(price):topic1                       -5.30362053 1.137905e-07
## continentTopFiveAsia:topic1              1.80942287 7.038861e-02
## continentTopFiveAustralia:topic1         2.54683099 1.087223e-02
## continentTopFiveEuropeTop5:topic1        0.95672559 3.387083e-01
## continentTopFiveNorthAmerica:topic1     -1.97681126 4.806596e-02
## continentTopFiveOtherEurope:topic1       0.00479717 9.961724e-01
## continentTopFiveSouthAmerica:topic1      3.06458606 2.180347e-03

Normality of the errors and homoskedasticity are suspect, as demonstrated by the Breusch-Pagan and Shapiro-Wilk tests. We see that the variance seems highest for points around 90, and tapers to the extremes. However, because Shapiro-Wilk in R can take only a maximum of 5000 residuals, sampling is required, and this test is then dependent on the choice of random seed used in the sampling.

We get around the normal and homoskedastic assumptions by using cross-validated RMSE to create our prediction intervals. Doing so gives a 94% success rate on the withheld test set with a +/- 4.8 point prediction interval.

Potential for future improvements: This project could potentially be further expanded in the future by using deeper NLP, expanding with additional predictors, and expanded model selection attempts. It may also be the case that the data is noisy, with a high base \(\epsilon\) variance of points no matter which predictors are used. (subjective human judgement after all).

Appendix

About the LDA